ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition

نویسندگان

  • David Koslicki
  • Saikat Chatterjee
  • Damon Shahrivar
  • Alan W. Walker
  • Suzanna C. Francis
  • Louise J. Fraser
  • Mikko Vehkaperä
  • Yueheng Lan
  • Jukka Corander
  • Jonathan H. Badger
چکیده

MOTIVATION Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging. RESULTS There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity. AVAILABILITY An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SEK: sparsity exploiting k-mer-based estimation of bacterial community composition

MOTIVATION Estimation of bacterial community composition from a high-throughput sequenced sample is an important task in metagenomics applications. As the sample sequence data typically harbors reads of variable lengths and different levels of biological and technical noise, accurate statistical analysis of such data is challenging. Currently popular estimation methods are typically time-consum...

متن کامل

Confidence Interval Estimation of the Mean of Stationary Stochastic Processes: a Comparison of Batch Means and Weighted Batch Means Approach (TECHNICAL NOTE)

Suppose that we have one run of n observations of a stochastic process by means of computer simulation and would like to construct a condifence interval for the steady-state mean of the process. Seeking for independent observations, so that the classical statistical methods could be applied, we can divide the n observations into k batches of length m (n= k.m) or alternatively, transform the cor...

متن کامل

The Effect of Microstructure on Estimation of the Fracture Toughness (KIC) Rotor Steel Using Charpy Absorbed Energy (CVN)

The proportional relationships between the Charpy absorbed energy (CVN) and the KIC values have been established for a wide variety of steels. Several formulae have been proposed that predict KIC from CVN. The purpose of this study is to investigate, by means of compact testing fracture toughness specimens, the effective role of microstructure for estimation of the fractur...

متن کامل

The receptor tyrosine kinase ARK mediates cell aggregation by homophilic binding.

The ARK (AXL, UFO) receptor is a member of a new family of receptor tyrosine kinases whose extracellular domain contains a combination of fibronectin type III and immunoglobulin motifs similar to those found in many cell adhesion molecules. ARK mRNA is expressed at high levels in the mouse brain, prevalently in the hippocampus and cerebellum, and this pattern of expression resembles that of adh...

متن کامل

Zonation of bacterioplankton communities along aging upwelled water in the northern Benguela upwelling

Upwelling areas are shaped by enhanced primary production in surface waters, accompanied by a well-investigated planktonic succession. Although bacteria play an important role in biogeochemical cycles of upwelling systems, little is known about bacterial community composition and its development during upwelling events. The aim of this study was to investigate the succession of bacterial assemb...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 10  شماره 

صفحات  -

تاریخ انتشار 2015